Note: Report has been split into two due to the size of the dataset and weight of visuals.
Photography has been a hobby of mine for most of my life, and I found a particular niche in abstract photography, specifically multi-exposure images. This background inspired me to find mathematical ways to analyze my photo library as a whole, with a special focus on color trends and affinities.
The processes used in this project have a business application within a mobile app. By evaluating a user’s camera roll, the app could discern favorite colors and suggest products that match that color profile.
Note:
Shout out to my buddy Phil! He was a great resource for feedback and encouragement as I formulated my processing script, but also donated runtime on his computer and processed 250 images used in this dataset.
A photograph with 4000 pixels may have 4000 different color values represented. I wanted to “clump” pixels with similar colors in the same area of an image into a single color value. PyMeanShift accomplishes this by taking in the image and three numerical variables: spatial radius, range radius, and minimum density. These refer to maximum color difference, maximum placement difference, and minimum “clump” size, respectively.
FlickrID - unique identifier for each photo
DateTimeOriginal/CreateDate/ModifyDate - attempted to capture whether the images were edited on the phone (unsuccessful)
Software - iOS version or mobile app used for photo capture
LensInfo/ LensModel - data on which phone lens capture the photo
JFIFVersion - compression marker applied by some 3rd party apps. Disappears when image is edited in native iOS photos app.
ISO - light sensitivity setting
ExposureTime - in seconds (fractions)
FNumber - aperture
FocalLength- Fixed to LensInfo/LensModel
FocalLengthIn35mmFormat - iOS interpretation of zoom level
BrightnessValue - Auto-generated brightness value
SubjectArea - Coordinate values generated by iOS (not directly relevant to this project, but captured for future use)
A python class was used to gather image data as attributes, then dumped to a csv with vars().
All relevant attributes/variables described below
using_id - Flickr ID
img_width - in pixels
img_height - in pixels
do_img_at - timestamp for evaluating processing time
sub_img - 0 for whole image, 1 for top-left, 2 for middle-left, 5 for center, etc.
full_id - concat of flickrID and sub_image to form unique identifier.
RGB Overview Statistics
(r/g/b)_min - (3 columns) Minimum red/green/blue channel value in the whole image
(r/g/b)_max - (3 columns) Maximum red/green/blue channel value
(r/g/b)_mean - (3 columns) Average red/green/blue channel value
(r/g/b)_mode - (3 columns) count of common red/green/blue channel value (forgot to capture its value :facepalm: (in my attempt to capture the value, I neglected to reset the index of the pandas dataSeries))
center_rgb - (tuple) R/G/B value of the pixel mathematically in the center of the image
Next segment of columns captured from segmented/posterized image
post_num_regions - number of color “clumps” after processing
post_top_hsl - (tuple) most common pixel value
post_top_count - quantity of most common pixel value
post_(2-6)_hsl - (5 columns)(tuples) next most common pixel values, in descending order of frequency
post_(2-6)_count - (5 columns) counts for their respective common pixel values
center_hsl - (tuple) HSL value of the pixel mathematically in the center of the image
Hue color banding was done by subjective eyeball measurement
All hues: red, orange, yellow, green, cyan, blue, purple, magenta
full_(hue)_count - count of all pixels that fell within the hue band, regardless of saturation and lightness
visib_(hue)_count - count of pixels in the hue band deemed as “visibly [hue]” (saturation over 40%, lightness between 20% and 75%)
vivid_(hue)_count - count of pixels in the hue band deemed as “vividly [hue]” (saturation over 70%, lightness between 30% and 70%)
Saturation Statistics
sat_min_val - lowest saturation value in image
sat_25_val - 25% quartile value
sat_50_val - median saturation
sat_75_val - 75% quartile value
sat_max_val - most saturation
HSL Mean Values
hue_mean_val - average hue value (not incredibly meaningful on a looping spectrum)
sat_mean_val - average saturation value
light_mean_val - average brightness
Lightness Statistics
light_max_val - brightest value
light_max_count - quantity of pixels within 1.5% (literal) of the max lightness value
light_min_val - darkest value
light_min_count - quantity of pixels within 1.5% (literal) of he minimum lightness value (darkest)
light_25_value - 25% quartile value
light_50_value - median brightness
light_75_value - 75% quartile value
gen_bright_count - quantity of pixels with over 85% lightness
gen_dark_count - quantity of pixels with under 15% brightness
common_hsl_(1-4)_val - (4 columns)(tuple) four most common HSL values
common_hsl_(1-4)_count - (4 columns) quantities of the four most common HSL values
Due to collecting image processing data on multiple computers, multiple files were created for exif and image data – partly by design and party due to occasional read/write conflicts on shared files. All records were gathered into Excel and checked for duplicates before exporting as CSVs.
\(Ho:\) There is no correlation between time of year and color values
\(Ha:\) Warm color values are more prominent between May and September
(Additional hypotheses noted in a postscript in Part 2)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
exif <- read_csv("capstone_exif.csv")
Rows: 1116 Columns: 15── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (10): DateTimeOriginal, CreateDate, ModifyDate, Software, LensInfo, LensModel, ExposureTime, FocalLength, FocalLengthIn3...
dbl (5): FlickrID, JFIFVersion, ISO, FNumber, BrightnessValue
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
img_data <- read_csv("capstone_img_data.csv")
Rows: 10955 Columns: 87── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (16): flickr, img_loc, the_image, crop_coords, center_rgb, post_top_hsl, post_2_hsl, post_3_hsl, post_4_hsl, post_5_hsl,...
dbl (70): using_id, img_width, img_height, do_img_at, sub_img, full_id, r_min, r_max, r_mean, r_mode, g_min, g_max, g_mean, ...
num (1): vivid_count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(exif)
cols(
FlickrID = col_double(),
DateTimeOriginal = col_character(),
CreateDate = col_character(),
ModifyDate = col_character(),
Software = col_character(),
LensInfo = col_character(),
LensModel = col_character(),
JFIFVersion = col_double(),
ISO = col_double(),
ExposureTime = col_character(),
FNumber = col_double(),
FocalLength = col_character(),
FocalLengthIn35mmFormat = col_character(),
BrightnessValue = col_double(),
SubjectArea = col_character()
)
spec(img_data)
cols(
flickr = col_character(),
using_id = col_double(),
img_loc = col_character(),
the_image = col_character(),
img_width = col_double(),
img_height = col_double(),
crop_coords = col_character(),
do_img_at = col_double(),
sub_img = col_double(),
full_id = col_double(),
r_min = col_double(),
r_max = col_double(),
r_mean = col_double(),
r_mode = col_double(),
g_min = col_double(),
g_max = col_double(),
g_mean = col_double(),
g_mode = col_double(),
b_min = col_double(),
b_max = col_double(),
b_mean = col_double(),
b_mode = col_double(),
center_rgb = col_character(),
post_num_regions = col_double(),
post_top_hsl = col_character(),
post_top_count = col_double(),
post_2_hsl = col_character(),
post_2_count = col_double(),
post_3_hsl = col_character(),
post_3_count = col_double(),
post_4_hsl = col_character(),
post_4_count = col_double(),
post_5_hsl = col_character(),
post_5_count = col_double(),
post_6_hsl = col_character(),
post_6_count = col_double(),
center_hsl = col_character(),
full_red_count = col_double(),
visib_red_count = col_double(),
vivid_red_count = col_double(),
full_orange_count = col_double(),
visib_orange_count = col_double(),
vivid_orange_count = col_double(),
full_yellow_count = col_double(),
visib_yellow_count = col_double(),
vivid_yellow_count = col_double(),
full_green_count = col_double(),
visib_green_count = col_double(),
vivid_green_count = col_double(),
full_cyan_count = col_double(),
visib_cyan_count = col_double(),
vivid_cyan_count = col_double(),
full_blue_count = col_double(),
visib_blue_count = col_double(),
vivid_blue_count = col_double(),
full_purple_count = col_double(),
visib_purple_count = col_double(),
vivid_purple_count = col_double(),
full_mag_count = col_double(),
visib_mag_count = col_double(),
vivid_mag_count = col_double(),
vivid_count = col_number(),
sat_min_val = col_double(),
sat_25_val = col_double(),
sat_50_val = col_double(),
sat_75_val = col_double(),
sat_max_val = col_double(),
hue_mean_val = col_double(),
sat_mean_val = col_double(),
light_mean_val = col_double(),
light_max_val = col_double(),
light_max_count = col_double(),
light_min_val = col_double(),
light_min_count = col_double(),
light_25_value = col_double(),
light_50_value = col_double(),
light_75_value = col_double(),
gen_bright_count = col_double(),
gen_dark_count = col_double(),
common_hsl_1_val = col_character(),
common_hsl_1_count = col_double(),
common_hsl_2_val = col_character(),
common_hsl_2_count = col_double(),
common_hsl_3_val = col_character(),
common_hsl_3_count = col_double(),
common_hsl_4_val = col_character(),
common_hsl_4_count = col_double()
)
Pre Import:
Sub-image data for main images (0) was bugged in the first hours of
image processing. This was fixed in Excel during the data collation
stage.
[Editors Note: numeric values looked consistent in Excel and R Studio, but were interpreted as characters, necessitating a large amount of manual re-typing in part 2. At the time it seemed faster than backing up and figuring out the ‘right’ way to fix them]
EXIF:
names(exif) <- gsub("([a-z0-9])([A-Z])", "\\1_\\2", names(exif))
names(exif) <- names(exif) %>% tolower()
exif_tidy <- select(exif, -c(date_time_original, modify_date, lens_info, fnumber, focal_length))
exif_tidy <- replace_na(exif_tidy, list(subject_area = "0 0 0 0", jfifversion = 0))
img_data:
imgsd_tidy <- select(img_data, -c(flickr, img_loc, the_image, img_width, img_height, crop_coords, do_img_at, r_mode, b_mode, g_mode))
imgsd_tidy <- replace_na(imgsd_tidy, list(
post_2_hsl = "(-1, -1, -1)",
post_3_hsl = "(-1, -1, -1)",
post_4_hsl = "(-1, -1, -1)",
post_5_hsl = "(-1, -1, -1)",
post_6_hsl = "(-1, -1, -1)"
)
)
EXIF
exif_tidy <- exif_tidy %>% separate(create_date, into = c('date', 'time'), sep = " ", remove = TRUE) %>% separate(date, into = c('year', 'month', 'day'), sep = ":")
exif_tidy$date <- as.Date(paste("1881", exif_tidy$month, exif_tidy$day, sep = "-"), format ="%Y-%m-%d")
Img_data
Count by flickr id
subimg_qty <- imgsd_tidy %>% count(using_id)
To my (happy) surprise, only 5 images have less than 10 results and only 2 have less than 6. In the interest of time, I’m noting these IDs by hand and simply removing them from my working data
good_ids <- subimg_qty[subimg_qty$n >=6, "using_id"]
imgsd_tidy <- imgsd_tidy %>% filter(using_id %in% good_ids$using_id)
exif_tidy <- exif_tidy %>% filter(flickr_id %in% good_ids$using_id)
imgsd_tidy <- imgsd_tidy %>% mutate(total_pixels = full_red_count +
full_orange_count +
full_yellow_count +
full_green_count +
full_cyan_count +
full_blue_count +
full_purple_count +
full_mag_count)
# remove brackets from string-encapsulated lists
imgsd_tidy$center_hsl <- str_replace(imgsd_tidy$center_hsl, '\\[|\\]', '')
# remove parens from string-encapsulated tuples
imgsd_tidy <- imgsd_tidy %>% mutate_all(~ gsub('\\(|\\)', '', .))
# separate pixel values into individual columns
imgsd_split <- imgsd_tidy %>%
separate(center_rgb,
into = c('center_r', 'center_g', 'center_b'),
sep = ',') %>%
separate(post_top_hsl,
into = c('post_top_hue', 'post_top_sat', 'post_top_light'),
sep = ',') %>%
separate(post_2_hsl,
into = c('post_2_hue', 'post_2_sat', 'post_2_light'),
sep = ',') %>%
separate(post_3_hsl,
into = c('post_3_hue', 'post_3_sat', 'post_3_light'),
sep = ',') %>%
separate(post_4_hsl,
into = c('post_4_hue', 'post_4_sat', 'post_4_light'),
sep = ',') %>%
separate(post_5_hsl,
into = c('post_5_hue', 'post_5_sat', 'post_5_light'),
sep = ',') %>%
separate(post_6_hsl,
into = c('post_6_hue', 'post_6_sat', 'post_6_light'),
sep = ',') %>%
separate(center_hsl,
into = c('center_hue', 'center_sat', 'center_light'),
sep = ',') %>%
separate(common_hsl_1_val,
into = c('common_hsl_1_hue', 'common_hsl_1_sat', 'common_hsl_1_light'),
sep = ',') %>%
separate(common_hsl_2_val,
into = c('common_hsl_2_hue', 'common_hsl_2_sat', 'common_hsl_2_light'),
sep = ',') %>%
separate(common_hsl_3_val,
into = c('common_hsl_3_hue', 'common_hsl_3_sat', 'common_hsl_3_light'),
sep = ',') %>%
separate(common_hsl_4_val,
into = c('common_hsl_4_hue', 'common_hsl_4_sat', 'common_hsl_4_light'),
sep = ',')
… I probably should have saved those independently during the python image processing stage.
To be added as needed:
custom_colors = c('#de4bcd','#8f2411', '#f5b049', '#9c8a1c', '#b3e32d', '#1a9615', '#05ffb4','#055eab', '#6b63ff','#45008f')
fig <- plot_ly(imgsd_split, x=~center_r, y= ~center_g, z= ~center_b,
type = "scatter3d", mode="markers", size = 1, color = ~sub_img, colors = custom_colors)
fig
Observation: Strong correlation to neutral values of all intensities.
fig <- plot_ly(imgsd_split, x=~r_mean, y= ~g_mean, z= ~b_mean,
type = "scatter3d", mode="markers", size = 2, color = ~sub_img, colors=custom_colors)
fig
Observation: an even stronger tendancy towards neutrality than strictly center pixels, but still a notable quantity of outliers. Most outliers have high blue values.
fig <- plot_ly(imgsd_split, x=~hue_mean_val, y= ~sat_mean_val, z= ~light_mean_val,
type = "scatter3d", mode="markers", size = 1, color = ~sub_img, colors= custom_colors)
fig
NA
Observation: Clustering tendency is towards low saturation (strongly) and high lightness (weakly). Hue is strongly clustered in red-orange values. This does not indicate visibly red pixels.
fig <- plot_ly(imgsd_split, x=~post_top_hue, y= ~post_top_sat, z= ~post_top_light,
type = "scatter3d", mode="markers", size = 2, color = ~sub_img, colors = custom_colors)
fig
Observation: Another strong tendency for low saturation (below the threshold of clearly being a specific color). There is almost a clean line between 0 saturation/0 light and 1 saturation/1 light, with a distinct majority of pixels falling on the side of lower saturation/higher brightness.
Within that triangular prism, there are hue bands under 50, at 220, and over 350, which roughly correlate between red (high and low) and blue(mid). I have a theory this is due to darker pixels being arbitrarily assigned pure-ish red/blue values through the technology behind image sensors.
fig <- plot_ly(imgsd_split,
x=~common_hsl_1_hue,
y= ~common_hsl_1_sat,
z= ~common_hsl_1_light,
type = "scatter3d", mode="markers",
size = 2, color = ~sub_img, colors = custom_colors)
fig
NA
Observations: Once again, the hue values have strong tendancies to clump at 220 and 40. There is also a noticeable curve at the corner of 0 saturation/0 lightness.
fig <- plot_ly(imgsd_split,
x=~common_hsl_2_hue,
y= ~common_hsl_2_sat,
z= ~common_hsl_2_light,
type = "scatter3d", mode="markers",
size = 2, color = ~sub_img, colors = custom_colors)
fig
fig <- plot_ly(imgsd_split,
x=~common_hsl_3_hue,
y= ~common_hsl_3_sat,
z= ~common_hsl_3_light,
type = "scatter3d", mode="markers",
size = 2, color = ~sub_img, colors = custom_colors)
fig
Warning: Ignoring 2 observationsWarning: Ignoring 2 observations
fig <- plot_ly(imgsd_split,
x=~common_hsl_4_hue,
y= ~common_hsl_4_sat,
z= ~common_hsl_4_light,
type = "scatter3d", mode="markers",
size = 2, color = ~sub_img, colors = custom_colors)
fig
Warning: Ignoring 2 observationsWarning: Ignoring 2 observations
Observations: Trends are staying consistent across all top 4 colors, including the same patterns in outliers. This points to the possibility that the top 4 values for each image are similar to each other.
One point of interest: the curve at 0sat/0light also seems to exist at 0sat/1 light, though not as strongly.
imgsd_split$common_hsl_4_hue <- as.numeric(imgsd_split$common_hsl_4_hue)
imgsd_split$common_hsl_4_sat <- as.numeric(imgsd_split$common_hsl_4_sat)
fig <- plot_ly(imgsd_split,
x=~common_hsl_4_hue,
y= ~common_hsl_4_sat,
type = "scatter", mode="markers", size = 2, color = ~sub_img, colors = custom_colors)
fig
Warning: Ignoring 2 observationsWarning: Ignoring 2 observations
Observation: Here we can more closely examine the hue trends present in the images. I’m fascinated by the cluster at 105 hue/60-80 saturation which is present in many sub-images but not 3, 9, or 6.
imgsd_split$gen_bright_count <- as.integer(imgsd_split$gen_bright_count)
imgsd_split$gen_dark_count <- as.integer(imgsd_split$gen_dark_count)
fig <- plot_ly(imgsd_split,
x=~gen_bright_count,
y= ~gen_dark_count,
type = "scatter", mode="markers",
size = 2, color = ~sub_img, colors = custom_colors)
fig
imgsd_split$gen_bright_count <- as.integer(imgsd_split$gen_bright_count)
imgsd_split$gen_dark_count <- as.integer(imgsd_split$gen_dark_count)
fig <- plot_ly(imgsd_split,
x=~gen_bright_count,
y= ~gen_dark_count,
type = "scatter", mode="markers",
size = 2, color = ~sub_img, colors = custom_colors)
fig
imgsd_split$sat_min_val <- as.numeric(imgsd_split$sat_min_val)
imgsd_split$sat_max_val <- as.numeric(imgsd_split$sat_max_val)
fig <- plot_ly(imgsd_split,
x=~sat_min_val,
y= ~sat_max_val,
type = "scatter", mode="markers",
size = 2, color = ~sub_img, colors = custom_colors)
fig
NA
-trim exif down to id/date-time/software/brightness info
Left join to add date/software data from exif to img_data
*do facets with sub_img numbers*
plot(cars)
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.